MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
نویسندگان
چکیده
High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We present then four fault tolerant protocols implemented in a new generic framework for fault tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a microbenchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault tolerant protocol comparison for MPI applications.
منابع مشابه
Automatic Fault - Tolerant MPI
High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applicati...
متن کاملMPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes
Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system f...
متن کاملIn-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...
متن کاملHigh Performance Broadcast Support in La-Mpi Over Quadrics
LA-MPI is a unique MPI implementation that provides network-level fault-tolerant message passing. This paper describes the efficient implementation of a scalable MPI broadcast algorithm. LA-MPI implements a generic version of the broadcast algorithm using a spanning tree method built on top of point-to-point messaging. However, the Quadrics network, with it’s hardware broadcast support, provide...
متن کاملBlocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJHPCA
دوره 20 شماره
صفحات -
تاریخ انتشار 2006